Serveur d'exploration sur la musique en Sarre

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Domain adaptation of statistical machine translation with domain-focused web crawling

Identifieur interne : 000086 ( Main/Exploration ); précédent : 000085; suivant : 000087

Domain adaptation of statistical machine translation with domain-focused web crawling

Auteurs : Pavel Pecina [République tchèque] ; Antonio Toral [Irlande (pays)] ; Vassilis Papavassiliou [Grèce] ; Prokopis Prokopidis [Grèce] ; Aleš Tamchyna [République tchèque] ; Andy Way [Irlande (pays)] ; Josef Van Genabith [Allemagne]

Source :

RBID : PMC:4479164

Abstract

In this paper, we tackle the problem of domain adaptation of statistical machine translation (SMT) by exploiting domain-specific data acquired by domain-focused crawling of text from the World Wide Web. We design and empirically evaluate a procedure for automatic acquisition of monolingual and parallel text and their exploitation for system training, tuning, and testing in a phrase-based SMT framework. We present a strategy for using such resources depending on their availability and quantity supported by results of a large-scale evaluation carried out for the domains of environment and labour legislation, two language pairs (English–French and English–Greek) and in both directions: into and from English. In general, machine translation systems trained and tuned on a general domain perform poorly on specific domains and we show that such systems can be adapted successfully by retuning model parameters using small amounts of parallel in-domain data, and may be further improved by using additional monolingual and parallel training data for adaptation of language and translation models. The average observed improvement in BLEU achieved is substantial at 15.30 points absolute.


Url:
DOI: 10.1007/s10579-014-9282-3
PubMed: 26120290
PubMed Central: 4479164


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Domain adaptation of statistical machine translation with domain-focused web crawling</title>
<author>
<name sortKey="Pecina, Pavel" sort="Pecina, Pavel" uniqKey="Pecina P" first="Pavel" last="Pecina">Pavel Pecina</name>
<affiliation wicri:level="3">
<nlm:aff id="Aff1">Charles University in Prague, Prague, Czech Republic</nlm:aff>
<country xml:lang="fr">République tchèque</country>
<wicri:regionArea>Charles University in Prague, Prague</wicri:regionArea>
<placeName>
<settlement type="city">Prague</settlement>
<region type="région" nuts="2">Bohême centrale</region>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Toral, Antonio" sort="Toral, Antonio" uniqKey="Toral A" first="Antonio" last="Toral">Antonio Toral</name>
<affiliation wicri:level="1">
<nlm:aff id="Aff2">Dublin City University, Dublin, Ireland</nlm:aff>
<country xml:lang="fr">Irlande (pays)</country>
<wicri:regionArea>Dublin City University, Dublin</wicri:regionArea>
<wicri:noRegion>Dublin</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Papavassiliou, Vassilis" sort="Papavassiliou, Vassilis" uniqKey="Papavassiliou V" first="Vassilis" last="Papavassiliou">Vassilis Papavassiliou</name>
<affiliation wicri:level="3">
<nlm:aff id="Aff3">Institute for Language and Speech Processing/Athena RIC, Athens, Greece</nlm:aff>
<country xml:lang="fr">Grèce</country>
<wicri:regionArea>Institute for Language and Speech Processing/Athena RIC, Athens</wicri:regionArea>
<placeName>
<settlement type="city">Athènes</settlement>
<region nuts="2" type="region">Attique (région)</region>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Prokopidis, Prokopis" sort="Prokopidis, Prokopis" uniqKey="Prokopidis P" first="Prokopis" last="Prokopidis">Prokopis Prokopidis</name>
<affiliation wicri:level="3">
<nlm:aff id="Aff3">Institute for Language and Speech Processing/Athena RIC, Athens, Greece</nlm:aff>
<country xml:lang="fr">Grèce</country>
<wicri:regionArea>Institute for Language and Speech Processing/Athena RIC, Athens</wicri:regionArea>
<placeName>
<settlement type="city">Athènes</settlement>
<region nuts="2" type="region">Attique (région)</region>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Tamchyna, Ales" sort="Tamchyna, Ales" uniqKey="Tamchyna A" first="Aleš" last="Tamchyna">Aleš Tamchyna</name>
<affiliation wicri:level="3">
<nlm:aff id="Aff1">Charles University in Prague, Prague, Czech Republic</nlm:aff>
<country xml:lang="fr">République tchèque</country>
<wicri:regionArea>Charles University in Prague, Prague</wicri:regionArea>
<placeName>
<settlement type="city">Prague</settlement>
<region type="région" nuts="2">Bohême centrale</region>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Way, Andy" sort="Way, Andy" uniqKey="Way A" first="Andy" last="Way">Andy Way</name>
<affiliation wicri:level="1">
<nlm:aff id="Aff2">Dublin City University, Dublin, Ireland</nlm:aff>
<country xml:lang="fr">Irlande (pays)</country>
<wicri:regionArea>Dublin City University, Dublin</wicri:regionArea>
<wicri:noRegion>Dublin</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Van Genabith, Josef" sort="Van Genabith, Josef" uniqKey="Van Genabith J" first="Josef" last="Van Genabith">Josef Van Genabith</name>
<affiliation wicri:level="3">
<nlm:aff id="Aff4">Universität des Saarlandes, 66123 Saarbrücken, Germany</nlm:aff>
<country xml:lang="fr">Allemagne</country>
<wicri:regionArea>Universität des Saarlandes, 66123 Saarbrücken</wicri:regionArea>
<placeName>
<region type="land" nuts="2">Sarre (Land)</region>
<settlement type="city">Sarrebruck</settlement>
</placeName>
</affiliation>
<affiliation wicri:level="3">
<nlm:aff id="Aff5">DFKI, German Research Center for Artificial Intelligence, 66123 Saarbrücken, Germany</nlm:aff>
<country xml:lang="fr">Allemagne</country>
<wicri:regionArea>DFKI, German Research Center for Artificial Intelligence, 66123 Saarbrücken</wicri:regionArea>
<placeName>
<region type="land" nuts="2">Sarre (Land)</region>
<settlement type="city">Sarrebruck</settlement>
</placeName>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">26120290</idno>
<idno type="pmc">4479164</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4479164</idno>
<idno type="RBID">PMC:4479164</idno>
<idno type="doi">10.1007/s10579-014-9282-3</idno>
<date when="2014">2014</date>
<idno type="wicri:Area/Pmc/Corpus">000046</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000046</idno>
<idno type="wicri:Area/Pmc/Curation">000045</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Curation">000045</idno>
<idno type="wicri:Area/Pmc/Checkpoint">000068</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Checkpoint">000068</idno>
<idno type="wicri:Area/Ncbi/Merge">000124</idno>
<idno type="wicri:Area/Ncbi/Curation">000124</idno>
<idno type="wicri:Area/Ncbi/Checkpoint">000124</idno>
<idno type="wicri:doubleKey">1574-020X:2014:Pecina P:domain:adaptation:of</idno>
<idno type="wicri:Area/Main/Merge">000086</idno>
<idno type="wicri:Area/Main/Curation">000086</idno>
<idno type="wicri:Area/Main/Exploration">000086</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Domain adaptation of statistical machine translation with domain-focused web crawling</title>
<author>
<name sortKey="Pecina, Pavel" sort="Pecina, Pavel" uniqKey="Pecina P" first="Pavel" last="Pecina">Pavel Pecina</name>
<affiliation wicri:level="3">
<nlm:aff id="Aff1">Charles University in Prague, Prague, Czech Republic</nlm:aff>
<country xml:lang="fr">République tchèque</country>
<wicri:regionArea>Charles University in Prague, Prague</wicri:regionArea>
<placeName>
<settlement type="city">Prague</settlement>
<region type="région" nuts="2">Bohême centrale</region>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Toral, Antonio" sort="Toral, Antonio" uniqKey="Toral A" first="Antonio" last="Toral">Antonio Toral</name>
<affiliation wicri:level="1">
<nlm:aff id="Aff2">Dublin City University, Dublin, Ireland</nlm:aff>
<country xml:lang="fr">Irlande (pays)</country>
<wicri:regionArea>Dublin City University, Dublin</wicri:regionArea>
<wicri:noRegion>Dublin</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Papavassiliou, Vassilis" sort="Papavassiliou, Vassilis" uniqKey="Papavassiliou V" first="Vassilis" last="Papavassiliou">Vassilis Papavassiliou</name>
<affiliation wicri:level="3">
<nlm:aff id="Aff3">Institute for Language and Speech Processing/Athena RIC, Athens, Greece</nlm:aff>
<country xml:lang="fr">Grèce</country>
<wicri:regionArea>Institute for Language and Speech Processing/Athena RIC, Athens</wicri:regionArea>
<placeName>
<settlement type="city">Athènes</settlement>
<region nuts="2" type="region">Attique (région)</region>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Prokopidis, Prokopis" sort="Prokopidis, Prokopis" uniqKey="Prokopidis P" first="Prokopis" last="Prokopidis">Prokopis Prokopidis</name>
<affiliation wicri:level="3">
<nlm:aff id="Aff3">Institute for Language and Speech Processing/Athena RIC, Athens, Greece</nlm:aff>
<country xml:lang="fr">Grèce</country>
<wicri:regionArea>Institute for Language and Speech Processing/Athena RIC, Athens</wicri:regionArea>
<placeName>
<settlement type="city">Athènes</settlement>
<region nuts="2" type="region">Attique (région)</region>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Tamchyna, Ales" sort="Tamchyna, Ales" uniqKey="Tamchyna A" first="Aleš" last="Tamchyna">Aleš Tamchyna</name>
<affiliation wicri:level="3">
<nlm:aff id="Aff1">Charles University in Prague, Prague, Czech Republic</nlm:aff>
<country xml:lang="fr">République tchèque</country>
<wicri:regionArea>Charles University in Prague, Prague</wicri:regionArea>
<placeName>
<settlement type="city">Prague</settlement>
<region type="région" nuts="2">Bohême centrale</region>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Way, Andy" sort="Way, Andy" uniqKey="Way A" first="Andy" last="Way">Andy Way</name>
<affiliation wicri:level="1">
<nlm:aff id="Aff2">Dublin City University, Dublin, Ireland</nlm:aff>
<country xml:lang="fr">Irlande (pays)</country>
<wicri:regionArea>Dublin City University, Dublin</wicri:regionArea>
<wicri:noRegion>Dublin</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Van Genabith, Josef" sort="Van Genabith, Josef" uniqKey="Van Genabith J" first="Josef" last="Van Genabith">Josef Van Genabith</name>
<affiliation wicri:level="3">
<nlm:aff id="Aff4">Universität des Saarlandes, 66123 Saarbrücken, Germany</nlm:aff>
<country xml:lang="fr">Allemagne</country>
<wicri:regionArea>Universität des Saarlandes, 66123 Saarbrücken</wicri:regionArea>
<placeName>
<region type="land" nuts="2">Sarre (Land)</region>
<settlement type="city">Sarrebruck</settlement>
</placeName>
</affiliation>
<affiliation wicri:level="3">
<nlm:aff id="Aff5">DFKI, German Research Center for Artificial Intelligence, 66123 Saarbrücken, Germany</nlm:aff>
<country xml:lang="fr">Allemagne</country>
<wicri:regionArea>DFKI, German Research Center for Artificial Intelligence, 66123 Saarbrücken</wicri:regionArea>
<placeName>
<region type="land" nuts="2">Sarre (Land)</region>
<settlement type="city">Sarrebruck</settlement>
</placeName>
</affiliation>
</author>
</analytic>
<series>
<title level="j">Language Resources and Evaluation</title>
<idno type="ISSN">1574-020X</idno>
<idno type="eISSN">1574-0218</idno>
<imprint>
<date when="2014">2014</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<p>In this paper, we tackle the problem of domain adaptation of statistical machine translation (SMT) by exploiting domain-specific data acquired by domain-focused crawling of text from the World Wide Web. We design and empirically evaluate a procedure for automatic acquisition of monolingual and parallel text and their exploitation for system training, tuning, and testing in a phrase-based SMT framework. We present a strategy for using such resources depending on their availability and quantity supported by results of a large-scale evaluation carried out for the domains of environment and labour legislation, two language pairs (English–French and English–Greek) and in both directions: into and from English. In general, machine translation systems trained and tuned on a general domain perform poorly on specific domains and we show that such systems can be adapted successfully by retuning model parameters using small amounts of parallel in-domain data, and may be further improved by using additional monolingual and parallel training data for adaptation of language and translation models. The average observed improvement in BLEU achieved is substantial at 15.30 points absolute.</p>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Ardo, A" uniqKey="Ardo A">A Ardö</name>
</author>
<author>
<name sortKey="Golub, K" uniqKey="Golub K">K Golub</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Baroni, M" uniqKey="Baroni M">M Baroni</name>
</author>
<author>
<name sortKey="Bernardini, S" uniqKey="Bernardini S">S Bernardini</name>
</author>
<author>
<name sortKey="Ferraresi, A" uniqKey="Ferraresi A">A Ferraresi</name>
</author>
<author>
<name sortKey="Zanchetta, E" uniqKey="Zanchetta E">E Zanchetta</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bertoldi, N" uniqKey="Bertoldi N">N Bertoldi</name>
</author>
<author>
<name sortKey="Haddow, B" uniqKey="Haddow B">B Haddow</name>
</author>
<author>
<name sortKey="Fouet, Jb" uniqKey="Fouet J">JB Fouet</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Brin, S" uniqKey="Brin S">S Brin</name>
</author>
<author>
<name sortKey="Page, L" uniqKey="Page L">L Page</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cho, J" uniqKey="Cho J">J Cho</name>
</author>
<author>
<name sortKey="Garcia Molina, H" uniqKey="Garcia Molina H">H Garcia-Molina</name>
</author>
<author>
<name sortKey="Page, L" uniqKey="Page L">L Page</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Espla Gomis, M" uniqKey="Espla Gomis M">M Esplà-Gomis</name>
</author>
<author>
<name sortKey="Forcada, Ml" uniqKey="Forcada M">ML Forcada</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gao, Z" uniqKey="Gao Z">Z Gao</name>
</author>
<author>
<name sortKey="Du, Y" uniqKey="Du Y">Y Du</name>
</author>
<author>
<name sortKey="Yi, L" uniqKey="Yi L">L Yi</name>
</author>
<author>
<name sortKey="Yang, Y" uniqKey="Yang Y">Y Yang</name>
</author>
<author>
<name sortKey="Peng, Q" uniqKey="Peng Q">Q Peng</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kilgarriff, A" uniqKey="Kilgarriff A">A Kilgarriff</name>
</author>
<author>
<name sortKey="Grefenstette, G" uniqKey="Grefenstette G">G Grefenstette</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Menczer, F" uniqKey="Menczer F">F Menczer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Menczer, F" uniqKey="Menczer F">F Menczer</name>
</author>
<author>
<name sortKey="Belew, Rk" uniqKey="Belew R">RK Belew</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Munteanu, Ds" uniqKey="Munteanu D">DS Munteanu</name>
</author>
<author>
<name sortKey="Marcu, D" uniqKey="Marcu D">D Marcu</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Resnik, P" uniqKey="Resnik P">P Resnik</name>
</author>
<author>
<name sortKey="Smith, Na" uniqKey="Smith N">NA Smith</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Srinivasan, P" uniqKey="Srinivasan P">P Srinivasan</name>
</author>
<author>
<name sortKey="Menczer, F" uniqKey="Menczer F">F Menczer</name>
</author>
<author>
<name sortKey="Pant, G" uniqKey="Pant G">G Pant</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yu, H" uniqKey="Yu H">H Yu</name>
</author>
<author>
<name sortKey="Han, J" uniqKey="Han J">J Han</name>
</author>
<author>
<name sortKey="Chang, Kcc" uniqKey="Chang K">KCC Chang</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<affiliations>
<list>
<country>
<li>Allemagne</li>
<li>Grèce</li>
<li>Irlande (pays)</li>
<li>République tchèque</li>
</country>
<region>
<li>Attique (région)</li>
<li>Bohême centrale</li>
<li>Sarre (Land)</li>
</region>
<settlement>
<li>Athènes</li>
<li>Prague</li>
<li>Sarrebruck</li>
</settlement>
</list>
<tree>
<country name="République tchèque">
<region name="Bohême centrale">
<name sortKey="Pecina, Pavel" sort="Pecina, Pavel" uniqKey="Pecina P" first="Pavel" last="Pecina">Pavel Pecina</name>
</region>
<name sortKey="Tamchyna, Ales" sort="Tamchyna, Ales" uniqKey="Tamchyna A" first="Aleš" last="Tamchyna">Aleš Tamchyna</name>
</country>
<country name="Irlande (pays)">
<noRegion>
<name sortKey="Toral, Antonio" sort="Toral, Antonio" uniqKey="Toral A" first="Antonio" last="Toral">Antonio Toral</name>
</noRegion>
<name sortKey="Way, Andy" sort="Way, Andy" uniqKey="Way A" first="Andy" last="Way">Andy Way</name>
</country>
<country name="Grèce">
<region name="Attique (région)">
<name sortKey="Papavassiliou, Vassilis" sort="Papavassiliou, Vassilis" uniqKey="Papavassiliou V" first="Vassilis" last="Papavassiliou">Vassilis Papavassiliou</name>
</region>
<name sortKey="Prokopidis, Prokopis" sort="Prokopidis, Prokopis" uniqKey="Prokopidis P" first="Prokopis" last="Prokopidis">Prokopis Prokopidis</name>
</country>
<country name="Allemagne">
<region name="Sarre (Land)">
<name sortKey="Van Genabith, Josef" sort="Van Genabith, Josef" uniqKey="Van Genabith J" first="Josef" last="Van Genabith">Josef Van Genabith</name>
</region>
<name sortKey="Van Genabith, Josef" sort="Van Genabith, Josef" uniqKey="Van Genabith J" first="Josef" last="Van Genabith">Josef Van Genabith</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Wicri/Sarre/explor/MusicSarreV3/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000086 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000086 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Wicri/Sarre
   |area=    MusicSarreV3
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     PMC:4479164
   |texte=   Domain adaptation of statistical machine translation with domain-focused web crawling
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Main/Exploration/RBID.i   -Sk "pubmed:26120290" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd   \
       | NlmPubMed2Wicri -a MusicSarreV3 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Sun Jul 15 18:16:09 2018. Site generation: Tue Mar 5 19:21:25 2024